Create index from BAM w/
look at data w/bash
In order to perform the analysis you must have:
You can:
unimiPhD conda environmentconda activate unimiPhD
or
source activate unimiPhD
As a toy example we can download only the Human chromosome 19
wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
And then the annotation in GTF format
wget ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
Uncompress the file
gunzip Homo_sapiens.GRCh38.95.gtf.gz
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir ./GenomeDir --sjdbGTFfile Homo_sapiens.GRCh38.95.gtf --genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.19.fa
Feb 15 00:55:42 ..... started STAR run
Feb 15 00:55:42 ... starting to generate Genome files
Genome_genomeGenerate.cpp:208:genomeGenerate: exiting because of *OUTPUT FILE* error: could not create output file ./GenomeDir/chrName.txt
Solution: check that the path exists and you have write permission for this file
Feb 15 00:55:45 ...... FATAL ERROR, exiting
Command exited with non-zero status 109
mkdir GenomeDir
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir ./GenomeDir --sjdbGTFfile Homo_sapiens.GRCh38.95.gtf --genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.19.fa
Feb 15 00:56:27 ..... started STAR run
Feb 15 00:56:27 ... starting to generate Genome files
Feb 15 00:56:29 ... starting to sort Suffix Array. This may take a long time...
Feb 15 00:56:30 ... sorting Suffix Array chunks and saving them to disk...
Feb 15 00:57:49 ... loading chunks from disk, packing SA...
Feb 15 00:57:53 ... finished generating suffix array
Feb 15 00:57:53 ... generating Suffix Array index
Feb 15 00:58:49 ... completed Suffix Array index
Feb 15 00:58:49 ..... processing annotations GTF
Feb 15 00:59:07 ..... inserting junctions into the genome indices
Feb 15 00:59:59 ... writing Genome to disk ...
Feb 15 00:59:59 ... writing Suffix Array to disk ...
Feb 15 01:00:04 ... writing SAindex to disk
Feb 15 01:00:18 ..... finished successfully
curl -o ERR431583_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431583/ERR431583_1.fastq.gz
curl -o ERR431583_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431583/ERR431583_2.fastq.gz
From the local server
scp user@192.168.200.213:ERR431583_1.fastq.gz ERR431583_1.fastq.gz
scp user@192.168.200.213:ERR431583_2.fastq.gz ERR431583_2.fastq.gz
fastqc -t 2 ERR431583_1.fastq.gz ERR431583_2.fastq.gz
zcat ERR431583_1.fastq.gz | wc -l
$ 58937804
zcat ERR431583_2.fastq.gz | wc -l
$ 58937804
that is the number of lines of each file.
58,937,804/4 = 14,734,451
Grap the first 1mln reads
zcat ERR431583_1.fastq.gz | head -n 4000000 | gzip > ERR431583_downsize_1.fastq.gz
zcat ERR431583_2.fastq.gz | head -n 4000000 | gzip > ERR431583_downsize_2.fastq.gz
trimmomatic PE -threads 4 -phred33 ERR431583_downsize_1.fastq.gz ERR431583_downsize_2.fastq.gz R1_P.fastq.gz R1_U.fastq.gz R2_P.fastq.gz R2_U.fastq.gz ILLUMINACLIP:${CONDA_PREFIX}/share/trimmomatic-0.38-1/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Input Read Pairs: 15494812 Both Surviving: 14734451 (95.09%) Forward Only Surviving: 664700 (4.29%) Reverse Only Surviving: 55203 (0.36%) Dropped: 40458 (0.26%)
TrimmomaticPE: Completed successfully
STAR --genomeDir ./GenomeDir \
--runThreadN 4 \
--readFilesIn R1_P.fastq.gz R2_P.fastq.gz \
--readFilesCommand zcat \
--genomeLoad LoadAndRemove \
--outFileNamePrefix MySample_ \
--outReadsUnmapped Fastx \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonicalUnannotated \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM 5000000000
samtools sort -o MySample_SortedByName.bam -O bam -n -@ 4 BAMFILE
samtools index MySample_SortedByName.bam